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Abstract: - Nowadays people share their opinions 
over the Internet. The growth of the social websites 
increases the people’s opinion towards the social and 
non-social entities and attributes. In this technological 
world mining these opinions and applying the 
sentiment analysis is a challenging task. The 
implementation of sentiment mining algorithms over 
the opinions is very much essential to things get 
classified and to provide a knowledge base for the 
information retrieval. This objective of this article is to 
review the sentiment mining approaches and suggests 
some reco mm endations to improve the sentiment 
analysis process for different data sources. 
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I. Introduction 

Sentiment mining otherwise known as opinion 
mining is the process of extracting emotions, thoughts, 
and way of thinking, mind-set, point of view, attitudes, 
appraisals and feelings of the people’s over products, 
services, goods, topics, issues, organization etc. 
People’s sentiments are in the form of text data, data 
set, web, social networks, opinion polls, debate, and 
forums. In sentiment, analysis process cont a ins 
various levels of analysis such as Document level 
analysis. Sentence level analysis and Phase Level 
analysis. The sys-tem analyzes user’s review and 
classifies it to either positive or negative. The data 
source contains more number of positive and negative 
comments. Opinions are directly ex-pressed by the 
people or expressed indirectly. System has to deal 
with those reviews and extract the opinion accurately. 
Sentiment analysis contains various processes such as 
subjectivity detection, sentiment prediction, sentiment 
summarization, text summarization, feature extraction 
and detecting the fake review. 

II. Sentiment Mining 

Sentiment Mining is a process for tracking the feel 
of public about certain products, services, goods, 


topics, issues, organization etc. It also makes a 
machine to accumulate and categorize opinions. 


Opinion extraction is a task to find out the polarity 
of reviews. Polarity represents positive or negative 
[20] [21]. Figure 1 shows Opinion Mining Model 


Opinion: Opinion represents the feelings, 
judgments or mind-set of the user’s. 

Opinion Holder: may be an organization or person 
who expresses their opinion about any object. 

Object: is a real world entity about which the opinion 
expressed. 
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Figure 1 Opinion Model 

III. Levels of Sentiment Mining 
A. Document Level 

In Document level analysis, a whole document 
considered for mining. It classifies that whole 
document is either positive or negative about any 
object. It classifies sentiment based on opinion rather 
than topic. A single review about a topic is considered 
in document level analysis [14]. 

Drawbacks: 

1. This system does not conclude what person 
likes or dislikes the object. 


ISSN - 2394-0573 


All Rights Reserved © 2016 IJEETE 


Page 47 


International Journal of Exploring Emerging Trends in Engineering (IJEETE) 
Vol. 03, Issue 02, Mar-Apr, 2016 Pg. 47-52 WWW.LTEETE.COM 


2. This level of sentiment classification is not 
suitable for blogs and forums which contain few 
opinionated sentences. 

3. It defines only the polarity of the document. 
But negative words does not represent the user dislikes 
every-thing or positive words does not represent the 
user likes everything. 

B. Sentence Level 

Unl ik e Document level analysis, sentence level 
analysis considers only sentences contains opinionated 
terms and states the result that the considered 
sentences are positive or negative [14]. 

Example: This is a good camera. 

Drawbacks: 

1 . A user may express more than one feeling in a 
sentence. If a user express likes as well as dislikes of 
an object at same sentence leads the system to rank it a 
neutral. It does not convey the real meaning what the 
user wants. 

2. If user expresses l ik eness at one sentence and 
dis-likes at another sentence then this leads to negative 
result. 

C. Word Level 

Word level analysis is otherwise called Phase Level 
Analysis which considers only the words from a 
sentence. It tokenizes a sentence into words and 
extracting the keywords from it then finds whether the 
keywords are positive or negative. 

Example: This is a good phone [14]. 

Here good is keyword. 

IV. Data Store 

Data Source are the location where the data 
residing. Data Source may be Text File, Review site, 
Dataset, Social media Comments, Micro-blogging, 
Blogs, Google play Android Appstore etc. 

A. Text File 

A set of positive or negative comments or opinion 
reviews are stored in the form of text file. 

B. Review Sites 

A large number of review sites are available in the 
internet. It may contains the review of a product, 
software, restaurant reviews, airway reviews etc. 
Those form of data used in most of sentiment 
classification. Some e-commerce websites are 


www.amazon.com , www.yelp.com, 
www.CNETdownload.com. 

C. Dataset 

Dataset available for movie reviews, amazon 
product re -view, weather forecasting dataset, and 
tumor cautions re-view dataset are available at 
websites. Using such a dataset system can perform 
sentiment polarity classification. 

D. Social Media Comments 

Social Media’s like Facebook, linked, public 
forums contains lot of people opinions in the form of 
electronic text. 

E. Micro Blogging 

Short messages sent by the people are represented 
as Micro-Blog. Twitter messages called “tweets” are 
used as a data source for sentiment analysis process. 
Those data source are Micro-Blogging. 

F. Blogs 

It is otherwise called Web-Log is a small paragraph 
of in-formation, opinion, diary called posts arranged in 
a chronological order. Blogs are used to analyze the 
mood of public, sales information of movie and sales 
analysis. 

G. Google Play Android AppStroe 

This appstore contain huge amount of reviews about 
app and its ranking made by the user’s 
opinion/reviews. 

V. Machine Learning Models 

Machine learning is the process to make the system 
to learn by its own. Machine learning algorithms 
makes system to build a model using some sample 
data. If any new data arrives then the system can able 
to predict based on already learned model 
[3] [6] [7] [16]. 

A. Supervised Learning 

Supervised learning (machine learning) takes input 
of known training dataset with labeled classes of the 
data, and constructs a model that generates the 
prediction response to new data [3] [17]. 

Bayesian Classifier: Naive Bayes is classification 
techniques which follows Bayes theorem that is Bayes 
statistics. It uses probabilistic approach for 
classification. Naive Bayes is otherwise called simple 
Byes, and Independence Bayes, sometimes it is called 
as Bayesian classifier and Idiot Bayes. 
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Naive Bayes classifier has been used to classify the 
text into various categories such as spam or ham, 
anger, happy, sorrow etc [1][4][8], Naive Bayes 
performs well with classes of high dependent feature 
3. This approach produces better accuracy when the 
dimensionality of input is high. For a document d and 
a class C, 

Pr (tIRi) Pr (Ri) 

Pr (Rilt) = 

Pr w 

Where Pr (Rilt) is the posterior probability 
Pr (tIRi) is the likelihood probability 
Pr (Ri) is the prior probability and 
Pr (t) is constant value t. 

Support Vector Machine: SVM introduced in 
COLT-92 by Boser, Gyon and Vapnik. SVM can be 
applied for classification or regression. It follows 
kernel based approach. SVM classifier has some 
predetermined format of input and output. The given 
input can be decomposed into words of vectors. Using 
those vectors, SVM constructs N-Dimensional hyper 
plane. The result may either +1 or -1. For each input, 
SVM predict the class Cj{ 1,-1 } corresponds to 
positive or negative[3][19]. 

The document dj as Cj € {1,-1} can be 
represented weight vector, 

= Xaj Cj, aj>0 

Where, aj is a multiplier. 

Maximum Entropy: The Maximum Entropy is a 
probabilistic classifier which is belongs to the class of 
exponential models. The maximum entropy does not 
assume that the features are conditionally independent 
to each other. It is based on Principle o Maximum 
Entropy. Maximum entropy follows search based 
optimization to find weights for the features that 
maximize the possibility of the training data [3]. The 
Probability of a class R given a document t and weight 
Xis 


exp£a ka fa (R,t) 

P(R|U) = 

Z ex PZ a Xa fa (R',t) 

R'=R 


Boosting Algorithm: Boosting is a machine learning 
collection of meta-algorithm which converts weak 
learners to strong ones. Boosting involves 
incrementally constructing an ensemble by training 
each new instance of model to emphasize the previous 
model misclassified training instance. 

The most popular Boosting algorithm is the 
Ada-Boost Algorithm. It has been applied to rule- 
based systems, Bayesian Classifier and decision trees. 
But one criticism of boosting is, this algorithm 
perform poorly while classifying the noisy data. 
Depending on dataset Choosing Boosting algorithm 
may become unsuccessful [3], 

Genetic Algorithms: John Holland invented Genetic 
Algorithms (GA) during 1960s to 1970s. GA follows 
heuristic approach which imitates natural selection and 
survival of the fittest. The solution of Genetic 
Algorithm is x-bit Chromosome that symbolizes one 
arrangement. Every chromo-some has a measure of 
accuracy of the classifier is represented as fitness 
score. Fitness score for n documents as, 

n m 

Fitness(s) =X Z sim(a,a-b)+sim(a ,a+b) 
a=l b=l 

Where, m is a range that describes the similarity 
size of neighborhood for each document. Iterations of 
this algorithm described in following three steps: 

1. Take two solutions x and y from the set of all 
solutions with higher fitness scores. 

2. Combine x and y using crossover operator to 
pro-duce new solution z. 

3. Occasionally, mutate solution by exchanging 
two documents in solution randomly [3] 

Bayes Model: Bayes Model is otherwise known as 
Belief Network or Bayesian Network. Bayes Network 
is a probabilistic graphical network. It is represented 
by using directed acyclic graph (DAG). DAG 
indicates a group of randomly selected attributes and 
dependencies among them. Each edge represents 
conditional dependencies; each non connected node 
represents variables that are not conditionally 
independent [1][3]. 

Bayesian Network used by Hernandez and 
Rodriguez to describe the real world problem by 
classified it into three related and different variables. 
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They propose Multi-dimensional Bayesian network 
classifier. 

K-Nearest Neighbor Algorithm: KNN is a non- 
parametric approach used for regression and 
classification. KNN is a kind of Lazy learners. The 
input contains K-nearest training examples. Object 
can be categorized based on max im um number of 
neighbor’s votes. If k=l then the class can be al- 
located to the object in case of classification [8] [22], 

The example data can be vectors in n-dimensional 
memory space, each with known class label. KNN 
memorize the vectors and its known concept or class. 
K is a constant value defined by the user and it 
classifies the unlabeled vector by assigning the most 
frequent label frequent among the k training examples. 

B. Un Supervised Learning 

Unsupervised learning is a process of making 
system to classify data without prior knowledge. It is 
used to clustering the data into to various groups based 
on distance or link between them. Unsupervised 
learning algorithms are used whenever input data set 
with unknown class labels [3 [9]. 

K-means Clustering Algorithm: MacQueen 
discovered K-means algorithm at 1967. K-means is 
famous algorithm used to clustering the unlabeled 
data. K-means is so easy to cluster the unlabeled data 
into number of groups (assume k clusters). K-means 
algorithm group n-attributes into k number of clusters. 
K-means define n-centroids, each cluster having one 
centroid. Distinct position of data causes distinct 
result. Then select each point and group it to the 
closest centroid. When all the points are taken into 
the account, then the initial stage is completed and an 
initial clustering is completed. Now again estimate n 
new centroids to the newly grouped objects. Then 
perform the same above mentioned process again and 
again between the new n-centroids and the input data 
set. Now, n-centroids move their position little by 
little until all the clusters are to be done. The 
objective function is 

k n 

J=I I || Xi (j) — Cj || 2 
j=l i=l 

Where, Cj is the centroid of cluster, ||Xi (j) — 
Oil 2 is distance function, J is the objective function, k 
is the number of cluster and n is the number of cases. 


Fuzzy C-Means Algorithm: Dunn invented this 
algorithm in 1973 and further developed by Bezdek at 
1981. Fuzzy c-means is to cluster the unlabeled data 
together as cluster. In this algorithm same data objects 
are grouped as many clusters. It uses membership 
levels to group same data to various clusters. The 
objective function is described as follows: 

N C 

Jm = Z Z uijm || Xi — Cj || 2, l<m<oo 

Mr 1 j=1 

Where, N represent number of data, C represent 
number of clusters, m is any value which is more than 
1, uij is the membership value, cj is the center of the 
cluster [10][18], 

C. Lexicon Based Approach 

Lexicon based approach is based on sentiment 
lexicon. Sentiment lexicon is a collection of 
precompiled words and expression related to people 
mind-set [5][11][18]. Lexicon based approach does 
not require prior knowledge or prior training to data 
classification. It is further divided as Corpus based 
approach; Dictionary based approach, and Manual 
approach to find sentiment polarity. 

Manual Approach: Construction of sentiment and 
its features are done by manually. It is so tedious as 
well as time consuming process. It is also impractical 
task. Dictionary Based Approach: Dictionary based 
approach initially collect set of opinionated data 
manually, after in-creasing by searches synonyms and 
antonyms from Senti-WordNet, Dictionary and 
thesaurus 12. If new words found then it is added to 
the seed list and do the same process until no new 
words found. Any correction of errors can be done 
manually. It does not specify the domain specific 
opinionated words and its orientation. 

Corpus Based Approach: Corpus based approach is 
based on prototype of corpora or document. To 
prepare a corpus require more number of words so it is 
not much effective like dictionary based approach. 
Corpus based approach does specify the domain 
specific opinionated words and its orientation. 
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Figure 2. Text Mining Levels, Data Source and their Classification 


VI Conclusion and Future Works 

The main aim of this survey is to discuss 
various techniques used for sentiment classification, 
various levels used in the mining process and various 
kinds of data sources. It is rep-resented in the above 
figure 2. Many of the organizations have putting their 
efforts in finding the best system for sentiment 
analysis. Some of the algorithms give good results but 
still many more limitations in these algorithms. This 
domain requires well scalable algorithm to classify the 
text accurately. In future it may be extended to mine 
opinions of all the languages. 
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